Clustering regions in PL, CZ, and SE based on similarities in population and/or density and/or geographical proximity
Note that clustering is not done separately for SE and PL/CZ groups of countries, but this too can be done
Visualising clusters of regions in maps as Martin has done is probably desirable, but requires Martin's assistance, as the related package function doesn't seem to work on non-Windows OS'es like mine
List of all regions in PL, CZ, and SE was pulled; as in Martin's research project paper, regions in CZ and SE are at NUTS3 level, whereas all regions in PL are at a NUTS2 level, except the capital region which is at a NUTS1 level
Apart from the above-mentioned NUTS IDs, each region has the following attributes: population, area, and (geographical) centroid
import pandas as pd
import numpy as np
regions = pd.read_csv("./data/regions.csv")
Note that Haversine distance has been employed here. This can of course be changed to Euclidean Distance
Also, the range of distances has been scaled down to [0, 1]
# geographical distance matrix
import haversine.haversine as haversine
centroids = dict(zip(regions.iloc[:,0],
regions.iloc[:,4].apply(lambda x: [float(i) for i in x.split(',')]))) # set up
geog_dist = pd.DataFrame([[haversine(centroids[reg_row], centroids[reg_col])
for reg_col in regions['NUTS3'].unique()]
for reg_row in regions['NUTS3'].unique()],
columns = regions['NUTS3'].unique()).rename(
dict(enumerate(regions['NUTS3'].unique()))) # actual haversine distance
geog_dist = geog_dist / geog_dist.max().max() # scaled
Base population dissimilarity between a pair of regions was taken as the absolute difference between their populations
These dissimilarities were then scaled down to [0, 1]
# population dissimilarity matrix
population = dict(zip(regions.iloc[:,0], regions.iloc[:,2])) # set up
pop_diss = pd.DataFrame([[abs(population[reg_row] - population[reg_col])
for reg_col in regions['NUTS3'].unique()]
for reg_row in regions['NUTS3'].unique()],
columns = regions['NUTS3'].unique()).rename(
dict(enumerate(regions['NUTS3'].unique()))) # actual pop difference
pop_diss = pop_diss / pop_diss.max().max() # scaled
Population density defined here as population per unit of area
Base population density dissimilarity between a pair of regions was taken as the absolute difference between their population densities
These dissimilarities were then scaled down to [0, 1]
# population density dissimilarity matrix
population_density = dict(zip(regions.iloc[:,0], regions.iloc[:,2]/regions.iloc[:,3])) # set up
pop_dens_diss = pd.DataFrame([[abs(population_density[reg_row] - population_density[reg_col])
for reg_col in regions['NUTS3'].unique()]
for reg_row in regions['NUTS3'].unique()],
columns = regions['NUTS3'].unique()).rename(
dict(enumerate(regions['NUTS3'].unique()))) # actual pop density difference
pop_dens_diss = pop_dens_diss / pop_dens_diss.max().max() # scaled
# adjacency (dis)similarity matrix
import sys
sys.path.append("src")
import location
adj_matrix, adj_matrix_nuts = location.adjacency_similarity_matrix() # load adjacency (dis)similarity matrix
adj_matrix = pd.DataFrame(adj_matrix, columns = adj_matrix_nuts).rename(dict(enumerate(adj_matrix_nuts))) # format into pandas data frame
adj_matrix = adj_matrix.reindex(index=regions['NUTS3'].unique(), columns=regions['NUTS3'].unique()) # shuffle labels to match other matrices
We build composite distance matrices by combining n (n >= 1) distance matrices built above
Specifically, for a given pair of regions, we treat the distance metric from each component distance matrix as a coordinate of an n dimensional vector and compute composite distance between the pair as the euclidean norm of this vector
# composite distance matrices
all_four_distances = np.sqrt(pop_diss**2 + pop_dens_diss**2 + geog_dist**2 + adj_matrix**2)
all_except_geog = np.sqrt(pop_diss**2 + pop_dens_diss**2 + adj_matrix**2)
all_except_adj = np.sqrt(pop_diss**2 + pop_dens_diss**2 + geog_dist**2)
all_except_pop = np.sqrt(adj_matrix**2 + pop_dens_diss**2 + geog_dist**2)
pop_pop_dens = np.sqrt(pop_diss**2 + pop_dens_diss**2)
pop_dens = pop_dens_diss
pop_dens_adj = np.sqrt(pop_dens_diss**2 + adj_matrix**2)
pop_dens_geog = np.sqrt(pop_dens_diss**2 + geog_dist**2)
It seems this may be the preferred way to cluster regions in this project is to use Czekanowski Diagrams
So I tried to import and use the 'RMaCzek' package here in a Python environment for the job, but had issues. See error screenshot
# export data files to csv
all_four_distances.to_csv('./data/clustering_distance_datasets/all_four_distances.csv')
all_except_geog.to_csv('./data/clustering_distance_datasets/all_except_geog.csv')
all_except_adj.to_csv('./data/clustering_distance_datasets/all_except_adj.csv')
all_except_pop.to_csv('./data/clustering_distance_datasets/all_except_pop.csv')
pop_pop_dens.to_csv('./data/clustering_distance_datasets/pop_pop_dens.csv')
pop_dens.to_csv('./data/clustering_distance_datasets/pop_dens.csv')
pop_dens_adj.to_csv('./data/clustering_distance_datasets/pop_dens_adj.csv')
pop_dens_geog.to_csv('./data/clustering_distance_datasets/pop_dens_geog.csv')
7.1. Clustering based on composite distance using all four components (geographical, adjacency, population and population density) with 'n_class = 6':
7.2. Clustering based on composite distance using three components viz. adjacency, population and population density with 'n_class = 6':
7.3. Clustering based on composite distance using three components viz. geographical, population and population density with 'n_class = 6':
7.4. Clustering based on composite distance using three components viz. geographical, adjacency and population density with 'n_class = 6':
7.5. Clustering based on composite distance using two components viz. population and population density with 'n_class = 6':
7.6. Clustering based on composite distance using population density alone, with 'n_class = 6':
7.7. Clustering based on composite distance using two components viz. population density and adjacency with 'n_class = 6':
7.8. Clustering based on composite distance using two components viz. population density and geographical distance with 'n_class = 6':
8.1. Clustering based on composite distance using all four components (geographical, adjacency, population and population density) with 'n_class = 6':
import seaborn as sns
sns.clustermap(all_four_distances, xticklabels = True, yticklabels = True, figsize = (10, 10))
8.2. Clustering based on composite distance using three components viz. adjacency, population and population density with 'n_class = 6':
sns.clustermap(all_except_geog, xticklabels = True, yticklabels = True, figsize = (10, 10))
8.3. Clustering based on composite distance using three components viz. geographical distance, population and population density with 'n_class = 6':
sns.clustermap(all_except_adj, xticklabels = True, yticklabels = True, figsize = (10, 10))
8.4. Clustering based on composite distance using three components viz. geographical distance, adjacency and population density with 'n_class = 6':
sns.clustermap(all_except_pop, xticklabels = True, yticklabels = True, figsize = (10, 10))
8.5. Clustering based on composite distance using two components viz. population and population density with 'n_class = 6':
sns.clustermap(pop_pop_dens, xticklabels = True, yticklabels = True, figsize = (10, 10))
8.6. Clustering based on population density, with 'n_class = 6':
sns.clustermap(pop_dens, xticklabels = True, yticklabels = True, figsize = (10, 10))
8.7. Clustering based on composite distance using two components viz. adjacency and population density with 'n_class = 6':
sns.clustermap(pop_dens_adj, xticklabels = True, yticklabels = True, figsize = (10, 10))
8.8. Clustering based on composite distance using two components viz. geographical distance and population density with 'n_class = 6':
sns.clustermap(pop_dens_geog, xticklabels = True, yticklabels = True, figsize = (10, 10))
NOTE: There seems to be some bug in sklearn.cluster.SpectralClustering, which gets trigerred when using any other composite distance matrices developed other than the two below
9.1. Clustering based on composite distance using two components (population and population density) with 'n_class = 6':
from sklearn.cluster import SpectralClustering
spectral_pop_pop_dens = SpectralClustering(n_clusters=6, assign_labels="discretize",
random_state=0, affinity = 'precomputed').fit(1 - pop_pop_dens)
spectral_pop_pop_dens = pd.DataFrame(zip(regions['NUTS3'], spectral_pop_pop_dens.labels_),
columns = ["NUTS3","Cluster"])
spectral_pop_pop_dens
9.2. Clustering based on population density alone, with 'n_class = 6':
spectral_pop_dens = SpectralClustering(n_clusters=6, assign_labels="discretize",
random_state=0, affinity = 'precomputed').fit(1 - pop_dens)
spectral_pop_dens = pd.DataFrame(zip(regions['NUTS3'], spectral_pop_dens.labels_),
columns = ["NUTS3","Cluster"])
spectral_pop_dens
regions